Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Winter Palace at night, a boat in front.

MongoDB and Raft

How MongoDB replication follows and doesn't the Raft algorithm

Henrik Ingo
Senior Performance Engineer
PGDay Russia, 2017-07-09

@h_ingo

Agenda

Raft

English: lifeboat, floating device

Academic: Leader based distributed consensus algorithm
John Osterhout, Diego Ongaro (USENIX 2014)

In practice: Single-master synchronous replication protocol

Similar to Vievstamp replication. Alternative to Paxos.

https://raft.github.io/

Raft goals

Easy to understand

Easy to discuss trade offs

Complete and practical
(Actually works if implemented)

Raft's concept of time

Raft replicates a log. The state machine (aka database) is secondary.

At any given time, (at most) one node is leader.

Leaders are chosen with elections.

Time between elections is called term.

When observing new term, servers must immediately adopt it.

Example Raft log

indextermcommand
11a <- 5
21a <- 7
31b <- 1
42...
52...
64...
74...
84...
94...

Raft replication

Raft leader election

Raft limitations

Additional reading on Raft

MongoDB Replication

MongoDB replication

MongoDB replica sets (2010)

  • Similarities
    to Raft
    • Leader
    • Log
    • Initial sync
    • Heartbeats & Elections

Protocol Version 0 (1.6 - 3.0)

Adopt term concept

Replication counts as heartbeat

Highest priority node likely,
but not guaranteed to win

Freshness: Only majority commit guaranteed

Simplify: Vote for first qualified candidate

No veto!

Protocol Version 1 (3.2 - )

Related 3.2-3.4 enhancements

PV0 oplog entry


> db.oplog.rs.find().sort( { $natural : -1 } ).limit(1).pretty()
{ "ts" : Timestamp(1444466011, 1),
  "h" : NumberLong("-6240522391332325619"),
  "v" : 2,
  "op" : "i",
  "ns" : "test.nulltest",
  "o" : { "_id" : 3, "a" : 3 }
}

unix time + counter

hash = gtid

PV1 oplog entry


> db.oplog.rs.find().sort( { $natural : -1 } ).limit(1).pretty()
{ "ts" : Timestamp(1445632081, 861),
  "t" : NumberLong(42),
  "h" : NumberLong("5466055178864103715"),
  "v" : 2,
  "op" : "u",
  "ns" : "ycsb.usertable",
  "o2" : { "_id" : "user645414720104877157" },
  "o" : { "$set" : { "field4" : BinData(0,"KT5sN1wrPl...Ny9wIg==") }
} 

same, used as Raft index

Raft term

random integer = gtid

MongoDB replication summary

Additional reading

Questions?

A young woman trying on earrings.
Rembrandt
CC-BY jankruithof @ Flickr